Static Jailbreak Attacks

Overview

Do Anything Now: DAN attacks are adversarial prompts that are crafted to bypass safeguards and manipulate LLMs to generate harmful text. These attacks are typically super lengthy prompts that request the model to role-play as a character, such as "Do-Anything-Now", that will listen to the user's requests.

Encoding: Encoding attacks are a set of attacks that attempt to subvert safeguards by encoding malicious user instructions in non-human readable formats. The basic assumption behind this class of attacks is that the safety guardrails are typically trained to perform well with English or human-readable instruction.

Persuasive Adversarial Prompts: PAP builds upon persuasive techniques developed in social science research to convince an LLM to break a set of rules. DynamoFL applies these persuasive techniques to attack the language model.

Greedy Coordinate Gradient: The GCG attack constructs adversarial suffixes designed to make LLMs more prone to start their responses with affirmative utterances, such as "Sure, here's…". Malicious prompts are sent to the model with these keywords added at the end. Studies have shown that many modern LLMs are susceptible to the same adversarial prefixes given that they are trained on similar pre-training data. DynamoFL runs a set of GCG attacks on open source models, caches these suffixes, and applies these to the user's target model.

Low Resource Languages: LRL is basically languages that have relatively less training data available on the internet. Studies have shown that providing adversarial prompts translated to low resource languages yields a high jailbreaking success rate on models such as GPT-4. This is likely due to the fact that many language models are aligned using high-resource languages like English, and neglect aligning LLMs on lower resource languages.

ASCII Art Attack: A form of text-based art, which enables creation of visual images using only the characters found on a standard ASCII keyboard. We encode adversarial instructions into ASCII art. Previous studies have shown that this is able to evade guardrails.

Metrics

Attack success rate (ASR) refers to the percentage of attempted jailbreaks that are successfully executed by an attacker.

Overview​

Metrics​

Overview

Metrics